actor-critic policy optimization
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
Reviews: Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
Specifically, it shows the connection by defining a new variant of an actor-critic algorithm that performs an exhaustive policy evaluation at each stage (denoted as policy-iteration-actor-critic), together with an adaptive learning rate. Then, under this setting, it is said that the actor-critic algorithm basically minimizes regret and converges to a Nash equilibrium. The paper suggests a few new versions of policy gradient update rules (Q-based Policy Gradient, Regret Policy Gradient, and Regret Matching Policy Gradient) and evaluates them on multi-agent zero-sum imperfect information games. To my understanding, Q-Based Policy Gradient is basically an advantage actor-critic algorithm (up to a transformation of the learned baseline) 3. The authors mention a "reasonable parameter sweep" over the hyperparameters. I'm curious to know the stability of the proposed actor-critic algorithms over the different trials 4. The paper should be proofread again.
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
Srinivasan, Sriram, Lanctot, Marc, Zambaldi, Vinicius, Perolat, Julien, Tuyls, Karl, Munos, Remi, Bowling, Michael
Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation.